Achieving Scalable Automated Diagnosis of Distributed Systems Performance Problems

نویسندگان

  • Chengdu Huang
  • Ira Cohen
  • Julie Symons
  • Tarek Abdelzaher
چکیده

Distributed systems continue to grow in scale and complexity, resulting in increasingly more involved interactions among components and increasingly more intricate failure modes that are very hard to diagnose manually. This increased vulnerability of larger systems, together with the increased difficulty of failure diagnosis, has motivated machine learning approaches to automate the diagnosis task. While preliminary encouraging results are achieved, scaling up the existing approaches to large applications remains challenging. With increase in scale, current approaches suffer the curse of dimensionality exacerbated by the exploding set of system states and measured metrics. In this paper, we significantly improve scalability of performance diagnosis methods. Our contributions lie in the use of (i) an intelligent partitioning of the metric space, coupled with a cooperative temporal segmentation algorithm, dividing system observations in time and in space to remove the multiplicative explosion of system states, and (ii) transfer learning techniques that improve accuracy by leveraging dependencies among the partitions. We validate our approaches on several months of production traces from a customer-facing geographically distributed, 24× 7, 3-tier internet service. Our results show a significant accuracy improvement (35% on average) over the naive partitioning of the state space (without the new temporal segmentation algorithm or transfer learning), and an order of magnitude reduction in computational cost over the “brute force” approach of learning with no partitioning, without loss of accuracy.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Dynamic configuration and collaborative scheduling in supply chains based on scalable multi-agent architecture

Due to diversified and frequently changing demands from customers, technological advances and global competition, manufacturers rely on collaboration with their business partners to share costs, risks and expertise. How to take advantage of advancement of technologies to effectively support operations and create competitive advantage is critical for manufacturers to survive. To respond to these...

متن کامل

Data Replication-Based Scheduling in Cloud Computing Environment

Abstract— High-performance computing and vast storage are two key factors required for executing data-intensive applications. In comparison with traditional distributed systems like data grid, cloud computing provides these factors in a more affordable, scalable and elastic platform. Furthermore, accessing data files is critical for performing such applications. Sometimes accessing data becomes...

متن کامل

Access control in ultra-large-scale systems using a data-centric middleware

  The primary characteristic of an Ultra-Large-Scale (ULS) system is ultra-large size on any related dimension. A ULS system is generally considered as a system-of-systems with heterogeneous nodes and autonomous domains. As the size of a system-of-systems grows, and interoperability demand between sub-systems is increased, achieving more scalable and dynamic access control system becomes an im...

متن کامل

Static Task Allocation in Distributed Systems Using Parallel Genetic Algorithm

Over the past two decades, PC speeds have increased from a few instructions per second to several million instructions per second. The tremendous speed of today's networks as well as the increasing need for high-performance systems has made researchers interested in parallel and distributed computing. The rapid growth of distributed systems has led to a variety of problems. Task allocation is a...

متن کامل

Achieving Scalable Cluster System Analysis and Management with a Gossip-Based Network Service

Clusters of workstations are increasingly used for applications requiring high levels of both performance and reliability. Certain fundamental services are highly desirable to achieve these twin goals of network-based cluster system analysis and management. Among these services is the ability to detect network and node failures and the capability to efficiently determine computer and network lo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007